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Introduction 

CAMP  software  has  been  used  in  a  variety  of  areas, 
but  at  the  end  of  TIPSTER  it  finishes  as  it  started- 
as  a  coreference  annotation  system.  The  corefer¬ 
ence  output  has  been  used  to  participate  in  MUC- 
6  and  MUC-7,  served  as  the  foundation  for  three 
types  of  summarization  engines  and  been  input  to 
a  cross-document  coreference  system  for  names  and 
events.  This  document  focuses  on  the  most  success¬ 
ful  of  these  application,  a  query  sensitive  summa¬ 
rization  system  and  a  cross-document  coreference 
system. 

Dynamic  Coreference-Based 
Summarization 

We  have  developed  a  query-sensitive  text  summa¬ 
rization  technology  well  suited  for  the  task  of  deter¬ 
mining  whether  a  document  is  relevant  to  a  query. 
Enough  of  the  document  is  displayed  for  the  user 
to  determine  whether  the  document  should  be  read 
in  its  entirety.  Evaluations  indicate  that  summaries 
are  classified  for  relevance  nearly  as  well  as  full  doc¬ 
uments.  This  approach  is  based  on  the  concept  that 
a  good  summary  will  represent  each  of  the  topics 
in  the  query  and  is  realized  by  selecting  sentences 
from  the  document  until  all  the  phrases  in  the  query 
which  are  represented  in  the  summary  are  ‘covered.’ 
A  phrase  in  the  document  is  considered  to  cover  a 
phrase  in  the  query  if  it  is  coreferent  with  it.  This 
approach  maximizes  the  space  of  entities  retained 
in  the  summary  with  minimal  redundancy.  The 
software  is  built  upon  the  CAMP  NLP  system  [3]. 

Problem  Statement 

Given  the  relative  immaturity  of  summarization 
technologies  and  their  evaluation,  it  is  worthwhile 
to  describe  our  approach  in  detail  and  the  prob¬ 
lems  it  is  intended  to  solve.  An  important  aspect 


of  our  technique  is  that  we  produce  sentence  extrac¬ 
tion  summaries  which  are  constructed  by  selecting 
sentences  from  the  source  document.  In  addition, 
our  summaries  are  focused  on  providing  relevant 
information  about  a  query.  We  feel  that  the  cur¬ 
rent  state-of-the-art  techniques  are  better  equipped 
to  produce  high  quality  query-sensitive  summaries 
than  generic  summaries.  Our  goal  is  to  produce 
‘indicative’  summaries  [5]  which  allow  a  user  to  de¬ 
termine  whether  the  document  is  relevant  to  his  or 
her  query.  The  summary  is  not  intended  to  replace 
the  document  or  provide  answers  to  questions  di¬ 
rectly  but  may  have  this  effect. 

Casting  our  technology  in  terms  of  a  product, 
we  see  the  application  as  an  intermediate  step  be¬ 
tween  viewing  entire  documents  and  the  output  of 
an  information  retrieval  engine.  Instead  of  looking 
at  either  headlines  or  an  entire  document,  the  user 
would  look  at  the  summaries  of  the  documents  and 
then  decide  whether  the  document  merited  further 
reading. 

Approach 

We  conducted  a  simple  experiment  with  summaries 
produced  in  the  TIPSTER  summarization  dry  run 
[8].  For  5  queries  with  200  documents  each,  we 
took  the  set  of  summaries  produced  by  the  6  dry- 
run  participants  and  retained  only  those  summaries 
that  were  true-positives,  i.e.,  the  summary  was 
judged  ‘relevant’  and  the  full  document  was  judged 
‘relevant’.  Over  all  the  queries,  at  least  one  of 
the  six  systems  produced  a  true-positive  summary 
for  96.6%  of  the  documents,  although  no  individ¬ 
ual  system  performed  nearly  at  that  level.  This 
meant  that  some  existing  technology  produced  a 
correct  summary  for  almost  every  relevant  docu¬ 
ment.  Hence  we  viewed  the  problem  as  one  of  bal¬ 
ancing  the  capabilities  of  our  system  to  behave  like 
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the  amalgamated  system  implicit  in  joined  output. 
Based  on  this  result  we  are  confident  that  this  class 
of  summarization  is  tractable  with  current  tech¬ 
nologies  and  this  has  strongly  motivated  our  design 
decisions. 

Upon  encountering  a  query  like  “Reporting 
on  possibility  of  and  search  for  extra-terrestrial 
life/intelligence.”,  we  assume  that  the  user  has  de¬ 
fined  a  class  of  actions,  ideas,  and/or  entities  that 
he  or  she  is  interested  in.  The  job  of  an  informa¬ 
tion  retrieval  engine  is  to  find  instantiations  of  those 
classes  in  text  documents  in  some  database.  We 
view  summarization  as  an  additional  step  in  this 
process  where  we  attempt  to  present  the  user  with 
the  smallest  collection  of  sentences  in  the  document 
that  instantiate  the  user  specified  classes  and  do 
not  mislead  the  user  about  the  overall  content  of 
the  document.  By  doing  so,  we  can  greatly  shorten 
the  amount  of  the  document  that  the  user  must 
read  in  order  to  determine  whether  the  document 
is  relevant  for  the  user’s  needs. 

Just  as  information  retrieval  algorithms  approx¬ 
imate  document  relatedness  by  examining  various 
string  matchings  between  the  query  and  the  text, 
we  approximate  certain  classes  of  coreference  be¬ 
tween  the  query  and  the  text  by  examining  lin¬ 
guistic  information.  These  coreference  relations  in¬ 
clude  identity  of  reference  and  part-whole  relations 
for  nominal  and  verbal  phrases.1  This  moves  us  a 
step  closer  to  reasoning  at  a  more  appropriate  level 
of  generalization,  for  summarization,  which  is  still 
technologically  feasible.  Below  are  examples  indi¬ 
cating  the  classes  of  relatedness  that  we  are  trying 
to  capture. 

The  identity  relation  between  the  query 
and  the  document 

Noun  phrase  coreference  is  the  best  understood 
class  of  relations  that  we  compute.  For  example, 
there  is  coreference  between  ‘Federal  Emergency 
Management  Agency’  in  the  query  and  the  acronym 
‘FEMA’  in  the  document  below: 

Query:  What  is  the  main  function  of  the  Fed¬ 
eral  Emergency  Management  Agency 
and  the  funding  level  provided  to  meet  emer¬ 
gencies? 

Document:  . . .  FEMA  agrees  that  “fine- 

tuning”  is  needed  to  the  1974  act  establishing 
a  coordinated  federal  program  to  prepare  for 

*It  is  not  clear  whether  more  sophisticated  anno¬ 
tations  are  appropriate  for  information  retrieval,  and 
perhaps  more  to  the  point,  it  is  not  clear  that  there  are 
sufficient  resources  to  process  2  GB  collections  of  data. 


and  respond  to  hurricanes,  tornadoes,  storms 
and  floods.  . . . 

Since  these  noun  phrases  refer  to  the  same  entity  in 
the  world,  sentences  that  mention  the  organization 
would  be  particularly  valuable  in  a  summary.  This 
class  of  coreference  can  include  people,  companies 
and  objects  such  as  automobiles  or  aluminum  sid¬ 
ing.  It  need  not  be  restricted  to  proper  nouns  as 
it  is  possible  to  refer  to  an  entity  using  common 
nouns,  i.e.  ‘the  agency’  and  pronouns. 

Identity  also  holds  between  events  mentioned  in 
the  query  and  document.  Sometimes  the  event 
that  a  query  describes  is  the  best  indicator  of  what 
document  should  be  retrieved,  and  correspondingly 
what  sentences  are  appropriate  for  a  summary. 
Consider  the  following: 

Query:  A  relevant  document  will  provide  new 
theories  about  the  1960’s  assassination  of 
President  Kennedy. 

Document:  . . .  The  House  Assassinations 

Committee  concluded  in  1978  that  Kennedy 
was  “probably”  assassinated  as  the  result  of  a 
conspiracy  involving  a  second  gunman,  a  find¬ 
ing  that  broke  from  the  Warren  Commission’s 
belief  that  Lee  Harvey  Oswald  acted  alone  in 
Dallas  on  Nov.  22,  1963.  . . . 

The  noun  phrase  ‘the  1960’s  assassination’  refers 
to  an  event,  which  is  the  same  as  the  one  referred 
to  in  the  document  with  the  verb  ‘assassinated’. 
Note  also  that  there  is  coreference  between  ‘Presi¬ 
dent  Kennedy’  and  ‘Kennedy’  in  the  document. 

The  part-whole  relation  between  the 
query  and  the  document 

In  addition  to  the  identity  relation,  phrases  in  a 
text  which  refer  to  parts  of  an  entity  or  concept 
mentioned  in  the  query  will  likely  provide  useful 
information,  and  therefore  should  be  included  in  a 
summary.  Finding  these  relations  in  in  general  is 
beyond  the  scope  of  this  paper,  however,  our  ap¬ 
proximation  of  a  subclass  of  these  relations  proved 
helpful  for  a  number  of  queries. 

A  strong  example  of  the  part-whole  relation  oc¬ 
curs  when  a  country  is  mentioned  in  the  query  and 
a  province  or  city  within  that  country  is  mentioned 
in  the  document.  For  example: 

Query:  Document  will  discuss  efforts  by  the 
black  majority  in  South  Africa  to  over¬ 
throw  domination  by  the  white  minority  gov¬ 
ernment. 
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Document:  About  90  soldiers  have  been 

arrested  and  face  possible  death  sentences 
stemming  from  a  coup  attempt  in  Bo- 
phuthatswana,  . . .  Rebel  soldiers  staged  the 
takeover  bid  Wednesday,  detaining  homeland 
President  Lucas  Mangope.  . . . 

Bophuthatswana  is  inside  South  Africa,  and  sen¬ 
tences  that  mention  it  are  clearly  good  candidates 
for  inclusion  in  a  summary. 

We  also  consider  part-whole  relations  between 
events  as  in  the  relation  between  ‘overthrow’  and 
‘staged’  and  ‘detained’.  Those  events  are  sub-parts 
of  overthrow  events,  and  as  such,  sentences  that 
contain  sub-parts  of  the  events  are  reasonable  can¬ 
didates  for  inclusion  in  summaries. 

Implementation 

The  summarization  technique  was  developed  within 
the  CAMP  NLP  framework.  This  system  provides 
an  integrated  environment  in  which  to  access  many 
levels  of  linguistic  information  as  well  as  world 
knowledge.  Its  main  components  include:  named 
entity  recognition,  tokenization,  sentence  detec¬ 
tion,  part-of-speech  tagging,  morphological  analy¬ 
sis,  parsing,  argument  detection,  and  coreference 
resolution.  Many  of  the  techniques  used  for  these 
tasks  perform  at  or  near  the  state  of  the  art  and  are 
described  in  more  depth  in  [16,  12,  11,  9,  6,  2,  3]. 
The  system  produces  coreference  annotated  docu¬ 
ments  which  serve  as  the  input  to  the  summariza¬ 
tion  algorithm. 

Relating  the  query  to  the  document 

The  relationships  discussed  previously  are  approx¬ 
imated  via  a  series  of  associations  between  tokens 
in  the  query,  headline,  and  the  body  of  the  docu¬ 
ment.  Event  references  are  captured  by  associating 
verbs  or  nominalizations  in  the  query  with  verbs 
and  nominalizations  in  the  document. 

Given  three  verbal  forms  Vi  in  the  query,  V2  in 
the  document,  and  v3  in  the  set  of  all  verbal  forms, 
where  a  verbal  form  is  the  morphological  root  of  a 
verb  or  the  verb  root  corresponding  to  a  nominal- 
ization,  Vi  is  associated  with  V2  if  at  least  one  of 
the  following  criteria  are  met: 

1.  Oi  #  v2)  Ap(vuv2)/(p(vi)p(v2))  >  5 

2.  (t>i  =  v2)  A  (3v3  ^  vi  |  p{vi,v3)/p(vi)p(v3)  >  5) 

3.  (vi  =  V2)  A  (( subject(vi )  =  subject (V2))  V 
(i object(v\ )  =  object(v  2))) 


Here  p(vi)  is  the  probability  that  w*  occurs  in  a  doc¬ 
ument  and  p(vi,Vj)  is  the  probability  that  Vi  and 
Vj  occur  in  the  same  document.  These  probabili¬ 
ties  are  based  on  frequencies  gathered  from  approx¬ 
imately  45,000  Wall  Street  Journal  articles.  Crite¬ 
rion  1  is  a  measure  of  mutual  information  between 
two  verbs.  Criterion  2  is  used  to  rule  out  frequently 
occurring  verbs  such  as  “be”  and  “make”.  Crite¬ 
rion  3  allows  for  verbs  which  are  ruled  out  by  cri¬ 
terion  2  to  be  associated  when  additional  context 
is  available.  This  is  important  since  some  queries 
only  contain  verbal  forms  which  are  ruled  out  by 
criterion  2. 

Relationships  between  proper  nouns  are  made  on 
the  basis  of  string  matches,  acronym  matching,  and 
dictionary  lookup.  Acronyms  are  determined  either 
through  a  table  lookup  or  an  appositive  construc¬ 
tion  occurring  in  the  document  which  designates 
the  acronym  for  a  specific  proper  noun.  A  proper 
noun  in  the  query  is  considered  associated  with 
a  proper  noun  in  the  document  if  it  matches  the 
string  or  acronym  of  the  proper  noun  in  the  docu¬ 
ment  or  it  appears  in  the  definition  of  the  proper 
noun  in  the  document.  A  reverse  dictionary  lookup 
often  allows  cities  to  be  associated  with  the  country 
they  are  in. 

A  token  in  the  query  which  is  a  lowercase  noun  or 
adjective  is  associated  with  any  token  in  the  doc¬ 
ument  which  matches  its  morphological  root  and 
part  of  speech. 

Tokens  which  occur  in  the  headline  are  associ¬ 
ated  with  tokens  in  the  document  body  using  the 
same  criteria  as  the  query,  with  the  exclusion  of 
the  dictionary  lookup.  The  dictionary  lookup  was 
excluded  because  the  headline  will  likely  use  the 
same  lexicalization  of  a  proper  noun  as  that  used 
in  a  document.  This  is  less  likely  to  be  the  case 
with  the  query. 

Selecting  a  sentence 

The  associations  discussed  in  the  previous  section 
are  used  to  rank  and  select  sentences  from  the  doc¬ 
ument.  Every  token  in  the  document  which  is  asso¬ 
ciated  with  the  same  token  in  the  query  or  headline 
is  considered  to  be  in  the  same  coreference  chain.  A 
sentence  which  contains  any  token  in  a  given  coref¬ 
erence  chain  is  said  to  cover  that  chain. 

The  following  scores  are  computed  for  each  sen¬ 
tence  in  the  document: 

1 .  The  number  of  coreference  chains  from  the  query 

which  are  covered  by  the  sentence  and  haven’t 

been  covered  by  a  previously  selected  sentence. 
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2.  The  number  of  noun  coreference  chains  from  the 
query  which  are  covered  by  the  sentence  and  the 
number  of  verbal  terms  in  the  sentence  which  are 
chained  to  the  query. 

3.  The  number  of  coreference  chains  from  the  head¬ 
line  which  are  covered  by  the  sentence  and 
haven’t  been  covered  by  a  previously  selected 
sentence. 

4.  The  number  of  noun  coreference  chains  from  the 
headline  which  are  covered  by  the  sentence  and 
the  number  of  verbal  terms  in  the  sentence  which 
are  chained  to  the  headline. 

5.  The  number  of  coreference  chains  which  are  cov¬ 
ered  by  the  sentence  and  haven’t  been  covered  by 
a  previously  selected  sentence. 

6.  The  number  of  noun  coreference  chains  which  are 
covered  by  the  sentence. 

7.  The  index  of  the  sentence  in  the  document;  sen¬ 
tences  are  sequentially  numbered. 

The  sentences  are  sorted  based  on  the  above 
scores,  where  the  ith  scoring  criteria  is  only  consid¬ 
ered  in  case  of  a  tie  for  all  criteria  less  than  i.  Scores 
1-6  are  ranked  in  descending  order  while  score  7  is 
ranked  in  ascending  order.  The  top-ranked  sen¬ 
tence  is  selected,  and  scores  1,  3,  and  5  are  recom¬ 
puted  in  order  to  select  the  next  sentence.  Selection 
halts  when  all  coreference  chains  in  the  query  have 
been  covered  and  the  summary  contains  at  least  4 
sentences. 

Scores  1  and  2  are  used  to  select  sentences  which 
are  related  to  the  query.  Scores  3  and  4  are  mo¬ 
tivated  by  documents  which  have  1  or  2  sentences 
which  appear  related  to  the  query  but  if  presented 
alone  would  give  a  false  impression  of  the  true  con¬ 
tent  of  the  document.  Thus  sentences  related  to  the 
headline  are  presented  to  provide  additional  back¬ 
ground.  Consider  the  following  example: 

Query:  What  evidence  is  there  of  paramilitary 
activity  in  the  U.S.? 

Summary:  . . .  Last  month  the  extremists  used 
rocket-propelled  grenades  for  the  first  time  in 
three  attacks  on  police  and  paramilitary  units. 

This  sentence  was  selected  because  it  contains  to¬ 
kens  which  are  in  coreference  chains  with  tokens 
in  the  query;  however,  alone  it  is  potentially  mis¬ 
leading  because  the  place  of  the  attack  is  not  men¬ 
tioned.  This  ambiguity  is  resolved  when  the  follow¬ 
ing  sentence  is  selected  because  it  is  well  associated 
with  the  headline. 


Summary:  . . .  Sikh  militants  may  have  ac¬ 
quired  one  or  two  U.S.-made  Stinger  anti¬ 
aircraft  missiles  and  hidden  them  inside  the 
Golden  Temple,  the  Sikh  faith’s  holiest  shrine, 
Punjab  police  officials  said  Saturday.. . . 

This  provides  enough  background  information  for 
the  reader  to  realize  that  the  para-military  activity 
is  not  taking  place  in  the  U.S.  and  thus  that  the 
document  is  irrelevant  to  the  query. 

Likewise,  scores  5  and  6  act  similarly  to  3  and 
4  for  documents  which  do  not  contain  a  headline. 
We  found  this  particularly  important  for  advertise¬ 
ments  which  often  don’t  state  a  product  or  com¬ 
pany  name  in  the  beginning  of  the  document,  but 
will  repeat  these  names  numerous  times  throughout 
the  document. 

Generating  the  summary 

Once  sentences  have  been  selected,  they  are  pre¬ 
sented  in  the  order  they  occurred  in  the  document. 
Pronouns  which  do  not  have  a  referent  in  the  pre¬ 
vious  sentence  of  the  summary  are  filled  with  a 
more  descriptive  string  whenever  a  referent  can  be 
determined.  If  space  is  of  concern,  prepositional 
phrases  attached  to  nouns  (which  are  not  nominal- 
izations),  appositives,  conjoined  noun  phrases  and 
relative  clauses  are  removed,  provided  they  contain 
no  tokens  associated  with  the  query  or  the  head¬ 
line.  Since  determining  pronoun  referents  and  the 
selection  of  clauses  for  removal  are  subject  to  er¬ 
rors,  filled  pronouns  are  placed  in  square  brackets 
and  removed  clauses  are  replaced  with  an  ellipsis 
to  indicate  to  the  reader  that  the  original  text  has 
been  modified. 

Example  summary 

An  example  summary  which  demonstrates  many  of 
the  features  of  our  system  appears  below.  It  has 
been  constrained  to  be  approximately  10%  of  the 
original  document  length,  so  it  is  not  representa¬ 
tive  of  the  summaries  used  in  the  evaluation,  but 
it  contains  examples  of  the  of  both  pronoun  filling 
and  clause  deletion. 

The  last  sentence  in  the  summary  was  selected 
first  because  the  tokens  “death”, “sentence”,  “kill”, 
and  “term”  were  associated  with  the  nominaliza- 
tion  “punishment” .  The  stranded  pronoun  “it”  has 
also  been  filled.  Sentence  2  was  selected  next  be¬ 
cause  of  the  match-up  between  the  verb  “is”  and 
the  object  “deterrent”  in  the  document  and  the 
query.  Finally,  the  first  sentence  was  chosen  be¬ 
cause  there  is  another  mention  of  the  prison  name 
“Marion”  in  the  document.  This  summary  differs 
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from  the  one  generated  when  the  10%  length  con¬ 
straint  is  not  imposed,  because  some  higher  ranked 
sentences  were  passed  over  since  their  inclusion 
would  have  exceeded  the  length  restriction. 

Query:  Is  there  data  available  to  suggest  that 
capital  punishment  is  a  deterrent  to  crime? 

Summary:  “Marion  is  basically  the  end  of  the 
line,”  Bogdan  said. 

...  There  is  no  deterrent  ...  to  keep  them  from 
doing  this  again. 

Additionally,  [the  pending  Senate  bill]  would 
create  five  new  death  penalty  offenses:  mur¬ 
der  by  a  federal  inmate  serving  a  life  sentence; 
drug  kingpins  in  a  continuing  criminal  enter¬ 
prise  even  if  no  murders  occur;  drug  kingpins 
who  try  to  kill  to  obstruct  justice;  drug  felons 
who  unintentionally  kill  with  aggravated  reck¬ 
lessness;  and  people  who  kill  with  a  firearm 
during  a  violent  ...  crime. 

Evaluation 

In  order  to  evaluate  our  summarization  algorithm, 
we  selected  10  unseen  queries  from  the  Text  RE- 
trieval  Conference  (TREC)  document  collection. 
Summaries  were  generated  for  200  documents,  20 
per  query,  and  assessors2  were  asked  to  make  rele¬ 
vance  judgments  based  on  the  summaries.  A  doc¬ 
ument  was  considered  relevant  if  it  contained  the 
information  requested  in  the  query  or  if  the  as¬ 
sessor  believed  that  the  full  document  would  likely 
contain  this  information.  The  relevance  judgments 
were  then  compared  to  those  made  by  the  TREC 
assessors  using  the  full  document.  This  comparison 
places  a  summary  in  one  of  the  following  categories: 

•  a  =  judged  relevant,  full  document  is  relevant 

•  b  =  judged  relevant,  full  document  is  irrelevant 

•  c  =  judged  irrelevant,  full  document  is  relevant 

•  d  =  judged  irrelevant,  full  document  is  irrelevant 

Precision,  recall,  and  accuracy  are  then  computed 
as  follows: 

precision  =  a/(a+b) 
recall  =  a/(a+c) 
accuracy  =  (a+d)  /  (a+b+c+d) 

Compression  is  computed  over  the  number  of 
non-whitespace  characters  in  the  summary  and  the 
original  document.  Here  compression  is  defined  as 

2Each  author  served  as  an  assessor  making  judg¬ 
ments  for  100  documents  across  10  queries. 


the  percentage  of  the  document  that  was  not  in¬ 
cluded  in  the  summary: 


compression  =  iengthsummary) 

r  lengthdocument 

The  results  from  our  experiment  are  shown  in  the 
following  table: 


Precision 

82.8% 

101/(101+21) 

Recall 

77.7% 

101/(101+29) 

Compression 

82.8% 

(704686-121272) /704686 

Accuracy 

75.0% 

(101+49)/200 

A  second  evaluation  on  910  documents  was  per¬ 
formed  for  [5].  These  results  superficially  appear 
significantly  worse  than  those  from  the  initial  eval¬ 
uation  however  a  more  careful  analysis  (provided  in 
the  discussion  section)  shows  that  they  are  in  fact 
similar  to  the  results  of  the  previous  evaluation. 


Precision 

80.3% 

322/(322+79) 

Recall 

57.6% 

322/(322+237) 

Compression 

83.0% 

Accuracy 

65.3% 

(322+272)/910 

Discussion 

We  view  the  results  of  the  first  evaluation  as 
promising  in  that  they  compare  favorably  with 
inter-assessor  consistency  using  the  entire  docu¬ 
ment.  [15]  reports  unanimous  relevance  judgments 
by  three  assessors  for  71.7%  of  the  documents.  In¬ 
terpolating  this  figure  to  two  assessors  yields  an 
80.1%  agreement  figure.  Using  summaries  which 
on  average  are  only  17.2%  of  the  original  docu¬ 
ment,  our  assessors  matched  the  TREC  assessors 
for  75.0%  of  the  documents. 

The  second  evaluation  yielded  a  much  lower  re¬ 
call  figure  while  precision  remained  comparable. 
This,  however,  is  also  the  case  when  the  same  asses¬ 
sors  judgments  on  the  full  documents  are  compared 
to  those  of  the  TREC  assessors.  These  results  are 
as  follows: 


Precision 

83.5% 

167/(167+33) 

Recall 

63.5% 

167/(167+96) 

Compression 

100.0% 

Accuracy 

69.3% 

(167+124)/420 

We  view  these  results  as  favorable  as  well  since  our 
accuracy  is  65.3%  using  17.0%  of  the  document  on 
average  compared  to  69.3%  accuracy  using  the  en¬ 
tire  document.  The  discrepancy  between  the  two 
evaluations  appears  to  be  based  on  the  assessors  in 
the  second  evaluation  using  a  stricter  criteria  for 
relevance  than  that  used  by  the  previous  evalua¬ 
tion’s  assessors  or  the  TREC  assessors. 

It  was  noted  after  the  first  evaluation  that  dif¬ 
ferent  criteria  for  relevance  accounted  for  some  of 
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the  disagreement  between  our  assessors  and  the 
TREC  assessors.  Many  documents  considered  rele¬ 
vant  were  marked  as  irrelevant  due  to  different  no¬ 
tions  of  relevance  and  not  because  the  summary 
failed  to  provide  material  on  which  to  base  a  correct 
decision.  These  difficulties  only  hinder  the  evalua¬ 
tion  of  a  summary  system  and  not  its  use  in  an  ap¬ 
plication,  since  a  user  will  have  a  clear  idea  of  his 
or  her  intentions  when  determining  a  document’s 
relevance. 

As  we  mentioned  previously,  our  approach  has 
been  to  balance  methods  of  relating  the  query  to 
sentences  in  the  document.  The  nearly  100%  recall 
of  the  dry-run  summaries  encouraged  us,  and  we 
even  used  the  output  of  those  summaries  to  pro¬ 
vide  a  test-bed  for  evaluating  our  summaries.  Al¬ 
though  we  never  actively  sought  to  emulate  aspects 
of  other  systems  directly,  our  final  algorithm  does 
share  some  basic  ideas  and  approaches  from  those 
systems.  Some  of  the  similarities  are  listed  below: 

In  [4],  they  eliminate  redundant  information  from 
summaries  by  classifying  sentences  according  to 
Maximal  Marginal  Relevance  (MMR).  MMR  ranks 
text  chunks  according  to  their  dissimilarity  to  one 
another.  Summaries  can  then  be  produced  with 
sentences  that  are  maximally  dissimilar,  thereby 
increasing  the  likelihood  that  distinguishing  infor¬ 
mation  will  be  in  the  summary.  One  can  view  our 
coverage  requirement  for  terms  in  the  query  as  an 
attempt  to  pick  dissimilar  sentences  from  the  doc¬ 
ument.  Instead  of  MMR,  we  use  the  fact  that  a 
sentence  which  does  not  contain  redundantly  re¬ 
ferring  phrases  to  the  query  is  more  highly  ranked 
than  a  sentence  that  does. 

Our  individual  sentence  scoring  algorithm  shares 
some  properties  with  [14].  Their  approach  includes 
scores  for  anaphoric  density,  string  equivalence  with 
the  title  or  headline  of  a  document,  and  position 
of  the  sentence  in  the  document.  However,  we  do 
not  take  advantage  of  overt  cues  for  summary  sen¬ 
tences,  such  as  ‘in  summary’  or  ‘in  conclusion’,  nor 
do  we  use  temporal  information  in  generating  a 
summary. 

Like  many  systems,  we  do  a  form  of  word  ex¬ 
pansion  in  attempting  to  relate  the  query  to  the 
document.  However,  the  fact  that  we  restrict  ex¬ 
pansion  to  proper  nouns  and  verbs  and  their  nom- 
inalizations  is  notable.  We  found  this  limited  set 
of  expansions  restricts  the  relations  between  the 
text  and  the  query  well  and  also  fits  within  the 
framework  of  part-whole  relations  in  coreference. 
We  did  not  consider  part-whole  relations  for  com¬ 
mon  nouns,  because  in  practice  we  have  not  had 


very  good  results  limiting  over-generation  in  that 
domain. 

In  the  next  section  we  discuss  a  novel  technology 
for  cross  document  coreference.  Like  the  summa¬ 
rization  system  just  discussed,  it  takes  within  doc¬ 
ument  coreference  annotated  text,  produces  sum¬ 
maries  in  a  very  similar  form  to  the  above,  and 
individuates  entities  based  on  the  similarity  of  the 
summaries  produced. 

Cross-document  Coreference 

Cross-document  coreference  occurs  when  the  same 
person,  place,  event,  or  concept  is  discussed  in  more 
than  one  text  source.  Computer  recognition  of  this 
phenomenon  is  important  because  it  helps  break 
“the  document  boundary”  by  allowing  a  user  to 
examine  information  about  a  particular  entity  from 
multiple  text  sources  at  the  same  time.  In  partic¬ 
ular,  resolving  cross-document  coreferences  allows 
a  user  to  identify  trends  and  dependencies  across 
documents.  Cross-document  coreference  can  also 
be  used  as  the  central  tool  for  producing  summaries 
from  multiple  documents,  and  for  information  fu¬ 
sion,  both  of  which  have  been  identified  as  advanced 
areas  of  research  by  the  TIPSTER  Phase  III  pro¬ 
gram.  Cross-document  coreference  was  also  iden¬ 
tified  as  one  of  the  potential  tasks  for  the  Sixth 
Message  Understanding  Conference  (MUC-6)  but 
was  not  included  as  a  formal  task  because  it  was 
considered  too  ambitious  [10]. 

In  this  paper  we  describe  a  highly  success¬ 
ful  cross-document  coreference  resolution  algorithm 
which  uses  the  Vector  Space  Model  to  resolve  am¬ 
biguities  between  people  having  the  same  name.  In 
addition,  we  also  describe  a  scoring  algorithm  for 
evaluating  the  cross-document  coreference  chains 
produced  by  our  system  and  we  compare  our  algo¬ 
rithm  to  the  scoring  algorithm  used  in  the  MUC-6 
(within  document)  coreference  task. 

Cross-Document  Coreference:  The 
Problem 

Cross-document  coreference  is  a  distinct  technol¬ 
ogy  from  Named  Entity  recognizers  like  IsoQuest’s 
NetOwl  and  IBM’s  Textract  because  it  attempts 
to  determine  whether  name  matches  are  actually 
the  same  individual  (not  all  John  Smiths  are  the 
same).  Neither  NetOwl  or  Textract  have  mecha¬ 
nisms  which  try  to  keep  same-named  individuals 
distinct  if  they  are  different  people. 

Cross-document  coreference  also  differs  in  sub¬ 
stantial  ways  from  within-document  coreference. 
Within  a  document  there  is  a  certain  amount  of 
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consistency  which  cannot  be  expected  across  docu¬ 
ments.  In  addition,  the  problems  encountered  dur¬ 
ing  within  document  coreference  axe  compounded 
when  looking  for  coreferences  across  documents  be¬ 
cause  the  underlying  principles  of  linguistics  and 
discourse  context  no  longer  apply  across  docu¬ 
ments.  Because  the  underlying  assumptions  in 
cross-document  coreference  are  so  distinct,  they  re¬ 
quire  novel  approaches. 

Architecture  and  the  Methodology 

Figure  1  shows  the  architecture  of  the  cross¬ 
document  system  developed.  The  system  is  built 
upon  the  University  of  Pennsylvania’s  within  doc¬ 
ument  coreference  system,  CAMP,  which  partici¬ 
pated  in  the  Seventh  Message  Understanding  Con¬ 
ference  (MUC-7)  within  document  coreference  task. 

Our  system  takes  as  input  the  coreference  pro¬ 
cessed  documents  output  by  CAMP.  It  then  passes 
these  documents  through  the  SentenceExtractor 
module  which  extracts,  for  each  document,  all  the 
sentences  relevant  to  a  particular  entity  of  inter¬ 
est.  The  VSM-Disambiguate  module  then  uses  a 
vector  space  model  algorithm  to  compute  similari¬ 
ties  between  the  sentences  extracted  for  each  pair 
of  documents. 

Details  about  each  of  the  main  steps  of  the  cross¬ 
document  coreference  algorithm  are  given  below. 

•  First,  for  each  article,  CAMP  is  run  on  the  ar¬ 
ticle.  It  produces  coreference  chains  for  all  the 
entities  mentioned  in  the  article.  For  example, 
consider  the  two  extracts  in  Figures  2  and  4.  The 
coreference  chains  output  by  CAMP  for  the  two 
extracts  are  shown  in  Figures  3  and  5. 

•  Next,  for  the  coreference  chain  of  interest  within 
each  article  (for  example,  the  coreference  chain 
that  contains  “John  Perry”),  the  Sentence  Ex¬ 
tractor  module  extracts  all  the  sentences  that 
contain  the  noun  phrases  which  form  the  coref¬ 
erence  chain.  In  other  words,  the  SentenceEx¬ 
tractor  module  produces  a  “summary”  of  the  ar¬ 
ticle  with  respect  to  the  entity  of  interest.  These 
summaries  are  a  special  case  of  the  query  sen¬ 
sitive  techniques  being  developed  at  Penn  using 
CAMP.  Therefore,  for  doc.36  (Figure  2),  since 
at  least  one  of  the  three  noun  phrases  (“John 
Perry,”  “he,”  and  “Perry”)  in  the  coreference 
chain  of  interest  appears  in  each  of  the  three  sen¬ 
tences  in  the  extract,  the  summary  produced  by 
SentenceExtractor  is  the  extract  itself.  On  the 
other  hand,  the  summary  produced  by  Sentence- 
Extractor  for  the  coreference  chain  of  interest  in 


John  Perry,  of  Weston  Golf  Club,  an¬ 
nounced  his  resignation  yesterday.  He  was  the 
President  of  the  Massachusetts  Golf  Associa¬ 
tion.  During  his  two  years  in  office,  Perry 
guided  the  MGA  into  a  closer  relationship 
with  the  Women’s  Golf  Association  of  Mas¬ 
sachusetts. 


Figure  2:  Extract  from  doc.36 
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Figure  3:  Coreference  Chains  for  doc.36 

doc. 38  is  only  the  first  sentence  of  the  extract  be¬ 
cause  the  only  element  of  the  coreference  chain 
appears  in  this  sentence. 

•  For  each  article,  the  VSM-Disambiguate  mod¬ 
ule  uses  the  summary  extracted  by  the  Sentence- 
Extractor  and  computes  its  similarity  with  the 
summaries  extracted  from  each  of  the  other  ar¬ 
ticles.  Summaries  having  similarity  above  a  cer¬ 
tain  threshold  are  considered  to  be  regarding  the 
same  entity. 

University  of  Pennsylvania’s  CAMP 
System 

The  University  of  Pennsylvania’s  CAMP  system 
resolves  within  document  coreferences  for  several 
different  classes  including  pronouns,  and  proper 
names  [7].  It  ranked  among  the  top  systems  in  the 
coreference  task  during  the  MUC-6  and  the  MUC-7 
evaluations. 

The  coreference  chains  output  by  CAMP  enable 
us  to  gather  all  the  information  about  the  entity  of 
interest  in  an  article.  This  information  about  the 
entity  is  gathered  by  the  SentenceExtractor  module 
and  is  used  by  the  VSM-Disambiguate  module  for 
disambiguation  purposes.  Consider  the  extract  for 
doc.36  shown  in  Figure  2.  We  are  able  to  include 
the  fact  that  the  John  Perry  mentioned  in  this  ar¬ 
ticle  was  the  president  of  the  Massachusetts  Golf 
Association  only  because  CAMP  recognized  that 
the  “he”  in  the  second  sentence  is  coreferent  with 
“John  Perry”  in  the  first.  And  it  is  this  fact  which 
actually  helps  VSM-Disambiguate  decide  that  the 
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Coreference  Chains  for  doc.01 


Figure  1:  Architecture  of  the  Cross-Document  Coreference  System 


Oliver  “Biff”  Kelly  of  Weymouth  succeeds 
John  Perry  as  president  of  the  Massachusetts 
Golf  Association.  “We  will  have  continued 
growth  in  the  future,”  said  Kelly,  who  will 
serve  for  two  years.  “There’s  been  a  lot  of 
changes  and  there  will  be  continued  changes 
as  we  head  into  the  year  2000.  ” 


Figure  4:  Extract  from  doc. 38 


Figure  5:  Coreference  Chains  for  doc.38 


two  John  Perrys  in  doc. 36  and  doc.38  are  the  same 
person. 

The  Vector  Space  Model 

The  vector  space  model  used  for  disambiguating 
entities  across  documents  is  the  standard  vector 
space  model  used  widely  in  information  retrieval 
[13].  In  this  model,  each  summary  extracted  by 


the  SentenceExtractor  module  is  stored  as  a  vector 
of  terms.  The  terms  in  the  vector  are  in  their  mor¬ 
phological  root  form  and  are  filtered  for  stop-words 
(words  that  have  no  information  content  like  a,  the, 
of,  an,  .. .).  If  Si  and  S2  are  the  vectors  for  the  two 
summaries  extracted  from  documents  D\  and  D2, 
then  their  similarity  is  computed  as: 

Sim(Si,S2)  =  ^  wij  x  w2j 

common  terms  tj 

where  tj  is  a  term  present  in  both  Si  and  So ,  u>ij 
is  the  weight  of  the  term  tj  in  S\  and  w2j  is  the 
weight  of  tj  in  S2. 

The  weight  of  a  term  tj  in  the  vector  Si  for  a 
summary  is  given  by: 

tf  x  log  $ 

Wij  V/4  +  4  +  ...  +  4, 

where  tf  is  the  frequency  of  the  term  tj  in  the  sum¬ 
mary,  N  is  the  total  number  of  documents  in  the 
collection  being  examined,  and  df  is  the  number  of 
documents  in  the  collection  that  the  term  tj  occurs 
in.  i/s?j  +  s?2  +  ■  •  •  +  s?n  is  the  cosine  normaliza¬ 
tion  factor  and  is  equal  to  the  Euclidean  length  of 
the  vector  Si. 
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The  VSM-Disambiguate  module,  for  each  sum¬ 
mary  Si,  computes  the  similarity  of  that  summary 
with  each  of  the  other  summaries.  If  the  similar¬ 
ity  computed  is  above  a  pre-defined  threshold,  then 
the  entity  of  interest  in  the  two  summaries  are  con¬ 
sidered  to  be  coreferent. 

Experiments 

The  cross-document  coreference  system  was  tested 
on  a  highly  ambiguous  test  set  which  consisted  of 
197  articles  from  1996  and  1997  editions  of  the 
New  York  Times.  The  sole  criteria  for  including 
an  article  in  the  test  set  was  the  presence  or  the 
absence  of  a  string  in  the  article  which  matched 
the  “/John.*?Smith/”  regular  expression.  In  other 
words,  all  of  the  articles  either  contained  the  name 
John  Smith  or  contained  some  variation  with  a  mid¬ 
dle  initial/name.  The  system  did  not  use  any  New 
York  Times  data  for  training  purposes.  The  an¬ 
swer  keys  regarding  the  cross-document  chains  were 
manually  created,  but  the  scoring  was  completely 
automated. 

Analysis  of  the  Data 

There  were  35  different  John  Smiths  mentioned  in 
the  articles.  Of  these,  24  of  them  only  had  one  ar¬ 
ticle  which  mentioned  them.  The  other  173  articles 
were  regarding  the  11  remaining  John  Smiths.  The 
background  of  these  John  Smiths  ,  and  the  num¬ 
ber  of  articles  pertaining  to  each,  varied  greatly. 
Descriptions  of  a  few  of  the  John  Smiths  are: 
Chairman  and  CEO  of  General  Motors,  assistant 
track  coach  at  UCLA,  the  legendary  explorer,  and 
the  main  character  in  Disney’s  Pocahontas,  former 
president  of  the  Labor  Party  of  Britain. 

Results 

Figure  6  shows  the  precision,  recall,  and  F-Measure 
(with  equal  weights  for  both  precision  and  recall) 
using  the  B-CUBED  scoring  algorithm.  The  Vec¬ 
tor  Space  Model  in  this  case  constructed  the  space 
of  terms  only  from  the  summaries  extracted  by 
SentenceExtractor.  In  comparison,  Figure  7  shows 
the  results  (using  the  B-CUBED  scoring  algorithm) 
when  the  vector  space  model  constructed  the  space 
of  terms  from  the  articles  input  to  the  system  (it 
still  used  the  summaries  when  computing  the  simi¬ 
larity).  The  importance  of  using  CAMP  to  extract 
summaries  is  verified  by  comparing  the  highest  F- 
Measures  achieved  by  the  system  for  the  two  cases. 
The  highest  F-Measure  for  the  former  case  is  84.6% 
while  the  highest  F-Measure  for  the  latter  case  is 
78.0%.  In  comparison,  for  this  task,  named-entity 
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Figure  6:  Precision,  Recall,  and  F-Measure  Using 
the  B-CUBED  Algorithm  With  Training  On  the 
Summaries 

Precision/Recall  vs  Threshold 
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Figure  7:  Precision,  Recall,  and  F-Measure  Using 
the  B-CUBED  Algorithm  With  Training  On  Entire 
Articles 

tools  like  NetOwl  and  Textract  would  mark  all  the 
John  Smiths  the  same.  Their  performance  using 
our  scoring  algorithm  is  23%  precision,  and  100% 
recall. 

Figures  8  and  9  show  the  precision,  recall,  and 
F-Measure  calculated  using  the  MUC  scoring  al¬ 
gorithm.  Also,  the  baseline  case  when  all  the 
John  Smiths  are  considered  to  be  the  same  person 
achieves  83%  precision  and  100%  recall.  The  high 
initial  precision  is  mainly  due  to  the  fact  that  the 
MUC  algorithm  assumes  that  all  errors  are  equal. 

We  have  also  tested  our  system  on  other  classes 
of  cross-document  coreference  like  names  of  compa¬ 
nies,  and  events.  Details  about  these  experiments 
can  be  found  in  [1], 
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Precision/Recall  vs  Threshold 
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Figure  8:  Precision,  Recall,  and  F-Measure  Using 
the  MUC  Algorithm  With  Training  On  the  Sum¬ 
maries 

Precision/Recall  vs  Threshold 
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Figure  9:  Precision,  Recall,  and  F-Measure  Using 
the  MUC  Algorithm  With  Training  On  Entire  Ar¬ 
ticles 

Conclusions 

The  TIPSTER  phase  III  program  has  allowed  us  to 
explore  some  of  the  potential  application  areas  of 
coreference  annotation.  We  have  reported  on  our 
strongest  results,  a  summarization  system  and  a 
cross-document  coreference  system  for  names. 

The  query-sensitive  text  summarization  system 
is  nearly  as  effective  as  full  text  documents  for 
determining  whether  a  document  is  relevant  to 
the  query.  The  system  uses  a  limited  class  of 
coreference-based  relations  between  the  query  and 
the  document  to  select  sentences  which  represent 
instantiations  of  entities,  events,  or  concepts  artic¬ 
ulated  in  the  query. 

As  a  novel  research  problem,  cross  document 


coreference  provides  an  different  perspective  from 
related  phenomenon  like  named  entity  recognition 
and  within  document  coreference.  Our  system 
takes  summaries  about  an  entity  of  interest  and 
uses  various  information  retrieval  metrics  to  rank 
the  similarity  of  the  summaries.  We  found  it  quite 
challenging  to  arrive  at  a  scoring  metric  that  sat¬ 
isfied  our  intuitions  about  what  was  good  system 
output  v.s.  bad,  but  we  have  developed  a  scoring 
algorithm  that  is  an  improvement  for  this  class  of 
data  over  other  within  document  coreference  scor¬ 
ing  algorithms.  Our  results  are  quite  encouraging 
with  potential  performance  being  as  good  as  84.6% 
(F-Measure). 

Future  Goals 

Central  to  the  future  of  this  research  program  is 
the  CAMP  software  system.  We  are  continually  re¬ 
fining  and  extending  the  software  to  better  capture 
the  coreference  relations  that  we  need  and  to  re¬ 
duce  genre  dependent  aspects  of  the  system.  We 
are  currently  exploring  visualization  interfaces  to 
both  within  and  cross-document  coreference  which 
we  believe  will  provide  strong  motivation  for  im¬ 
portance  of  corefence  annotation  of  free  text  data¬ 
bases.  In  addition,  we  are  interested  in  generating 
cross-document  summaries  based  on  similar  tech¬ 
niques  to  our  within  document  summarization  sys¬ 
tem. 
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